%%html
<link rel="stylesheet" type="text/css" href="rise.css" />
Pandas dataframes (tables)¶
Learning goals
- You will be able to to work with tabular data sets.
- You will appreciate how useful pandas is for exploratory data analysis.
import pandas as pd
/var/folders/cq/0x3wj8sx1_dbbm61zp0zqbmm0000gn/T/ipykernel_94698/4080736814.py:1: DeprecationWarning:
Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
import pandas as pd
Data_Cortex_Nuclear.csv
Load a spreadsheet from file into a Pandas DataFrame.
df = pd.read_csv("data/Data_Cortex_Nuclear.csv")
df
| MouseID | DYRK1A_N | ITSN1_N | BDNF_N | NR1_N | NR2A_N | pAKT_N | pBRAF_N | pCAMKII_N | pCREB_N | ... | pCFOS_N | SYP_N | H3AcK18_N | EGR1_N | H3MeK4_N | CaNA_N | Genotype | Treatment | Behavior | class | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 309_1 | 0.503644 | 0.747193 | 0.430175 | 2.816329 | 5.990152 | 0.218830 | 0.177565 | 2.373744 | 0.232224 | ... | 0.108336 | 0.427099 | 0.114783 | 0.131790 | 0.128186 | 1.675652 | Control | Memantine | C/S | c-CS-m |
| 1 | 309_2 | 0.514617 | 0.689064 | 0.411770 | 2.789514 | 5.685038 | 0.211636 | 0.172817 | 2.292150 | 0.226972 | ... | 0.104315 | 0.441581 | 0.111974 | 0.135103 | 0.131119 | 1.743610 | Control | Memantine | C/S | c-CS-m |
| 2 | 309_3 | 0.509183 | 0.730247 | 0.418309 | 2.687201 | 5.622059 | 0.209011 | 0.175722 | 2.283337 | 0.230247 | ... | 0.106219 | 0.435777 | 0.111883 | 0.133362 | 0.127431 | 1.926427 | Control | Memantine | C/S | c-CS-m |
| 3 | 309_4 | 0.442107 | 0.617076 | 0.358626 | 2.466947 | 4.979503 | 0.222886 | 0.176463 | 2.152301 | 0.207004 | ... | 0.111262 | 0.391691 | 0.130405 | 0.147444 | 0.146901 | 1.700563 | Control | Memantine | C/S | c-CS-m |
| 4 | 309_5 | 0.434940 | 0.617430 | 0.358802 | 2.365785 | 4.718679 | 0.213106 | 0.173627 | 2.134014 | 0.192158 | ... | 0.110694 | 0.434154 | 0.118481 | 0.140314 | 0.148380 | 1.839730 | Control | Memantine | C/S | c-CS-m |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 1075 | J3295_11 | 0.254860 | 0.463591 | 0.254860 | 2.092082 | 2.600035 | 0.211736 | 0.171262 | 2.483740 | 0.207317 | ... | 0.183324 | 0.374088 | 0.318782 | 0.204660 | 0.328327 | 1.364823 | Ts65Dn | Saline | S/C | t-SC-s |
| 1076 | J3295_12 | 0.272198 | 0.474163 | 0.251638 | 2.161390 | 2.801492 | 0.251274 | 0.182496 | 2.512737 | 0.216339 | ... | 0.175674 | 0.375259 | 0.325639 | 0.200415 | 0.293435 | 1.364478 | Ts65Dn | Saline | S/C | t-SC-s |
| 1077 | J3295_13 | 0.228700 | 0.395179 | 0.234118 | 1.733184 | 2.220852 | 0.220665 | 0.161435 | 1.989723 | 0.185164 | ... | 0.158296 | 0.422121 | 0.321306 | 0.229193 | 0.355213 | 1.430825 | Ts65Dn | Saline | S/C | t-SC-s |
| 1078 | J3295_14 | 0.221242 | 0.412894 | 0.243974 | 1.876347 | 2.384088 | 0.208897 | 0.173623 | 2.086028 | 0.192044 | ... | 0.196296 | 0.397676 | 0.335936 | 0.251317 | 0.365353 | 1.404031 | Ts65Dn | Saline | S/C | t-SC-s |
| 1079 | J3295_15 | 0.302626 | 0.461059 | 0.256564 | 2.092790 | 2.594348 | 0.251001 | 0.191811 | 2.361816 | 0.223632 | ... | 0.187556 | 0.420347 | 0.335062 | 0.252995 | 0.365278 | 1.370999 | Ts65Dn | Saline | S/C | t-SC-s |
1080 rows × 82 columns
df.shape
(1080, 82)
Column titles.
df.columns
Index(['MouseID', 'DYRK1A_N', 'ITSN1_N', 'BDNF_N', 'NR1_N', 'NR2A_N', 'pAKT_N',
'pBRAF_N', 'pCAMKII_N', 'pCREB_N', 'pELK_N', 'pERK_N', 'pJNK_N',
'PKCA_N', 'pMEK_N', 'pNR1_N', 'pNR2A_N', 'pNR2B_N', 'pPKCAB_N',
'pRSK_N', 'AKT_N', 'BRAF_N', 'CAMKII_N', 'CREB_N', 'ELK_N', 'ERK_N',
'GSK3B_N', 'JNK_N', 'MEK_N', 'TRKA_N', 'RSK_N', 'APP_N', 'Bcatenin_N',
'SOD1_N', 'MTOR_N', 'P38_N', 'pMTOR_N', 'DSCR1_N', 'AMPKA_N', 'NR2B_N',
'pNUMB_N', 'RAPTOR_N', 'TIAM1_N', 'pP70S6_N', 'NUMB_N', 'P70S6_N',
'pGSK3B_N', 'pPKCG_N', 'CDK5_N', 'S6_N', 'ADARB1_N', 'AcetylH3K9_N',
'RRP1_N', 'BAX_N', 'ARC_N', 'ERBB4_N', 'nNOS_N', 'Tau_N', 'GFAP_N',
'GluR3_N', 'GluR4_N', 'IL1B_N', 'P3525_N', 'pCASP9_N', 'PSD95_N',
'SNCA_N', 'Ubiquitin_N', 'pGSK3B_Tyr216_N', 'SHH_N', 'BAD_N', 'BCL2_N',
'pS6_N', 'pCFOS_N', 'SYP_N', 'H3AcK18_N', 'EGR1_N', 'H3MeK4_N',
'CaNA_N', 'Genotype', 'Treatment', 'Behavior', 'class'],
dtype='object')
Each row has an associated index.
df.index
RangeIndex(start=0, stop=1080, step=1)
Row indexes are just 0-based indices.
df
| MouseID | DYRK1A_N | ITSN1_N | BDNF_N | NR1_N | NR2A_N | pAKT_N | pBRAF_N | pCAMKII_N | pCREB_N | ... | pCFOS_N | SYP_N | H3AcK18_N | EGR1_N | H3MeK4_N | CaNA_N | Genotype | Treatment | Behavior | class | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 309_1 | 0.503644 | 0.747193 | 0.430175 | 2.816329 | 5.990152 | 0.218830 | 0.177565 | 2.373744 | 0.232224 | ... | 0.108336 | 0.427099 | 0.114783 | 0.131790 | 0.128186 | 1.675652 | Control | Memantine | C/S | c-CS-m |
| 1 | 309_2 | 0.514617 | 0.689064 | 0.411770 | 2.789514 | 5.685038 | 0.211636 | 0.172817 | 2.292150 | 0.226972 | ... | 0.104315 | 0.441581 | 0.111974 | 0.135103 | 0.131119 | 1.743610 | Control | Memantine | C/S | c-CS-m |
| 2 | 309_3 | 0.509183 | 0.730247 | 0.418309 | 2.687201 | 5.622059 | 0.209011 | 0.175722 | 2.283337 | 0.230247 | ... | 0.106219 | 0.435777 | 0.111883 | 0.133362 | 0.127431 | 1.926427 | Control | Memantine | C/S | c-CS-m |
| 3 | 309_4 | 0.442107 | 0.617076 | 0.358626 | 2.466947 | 4.979503 | 0.222886 | 0.176463 | 2.152301 | 0.207004 | ... | 0.111262 | 0.391691 | 0.130405 | 0.147444 | 0.146901 | 1.700563 | Control | Memantine | C/S | c-CS-m |
| 4 | 309_5 | 0.434940 | 0.617430 | 0.358802 | 2.365785 | 4.718679 | 0.213106 | 0.173627 | 2.134014 | 0.192158 | ... | 0.110694 | 0.434154 | 0.118481 | 0.140314 | 0.148380 | 1.839730 | Control | Memantine | C/S | c-CS-m |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 1075 | J3295_11 | 0.254860 | 0.463591 | 0.254860 | 2.092082 | 2.600035 | 0.211736 | 0.171262 | 2.483740 | 0.207317 | ... | 0.183324 | 0.374088 | 0.318782 | 0.204660 | 0.328327 | 1.364823 | Ts65Dn | Saline | S/C | t-SC-s |
| 1076 | J3295_12 | 0.272198 | 0.474163 | 0.251638 | 2.161390 | 2.801492 | 0.251274 | 0.182496 | 2.512737 | 0.216339 | ... | 0.175674 | 0.375259 | 0.325639 | 0.200415 | 0.293435 | 1.364478 | Ts65Dn | Saline | S/C | t-SC-s |
| 1077 | J3295_13 | 0.228700 | 0.395179 | 0.234118 | 1.733184 | 2.220852 | 0.220665 | 0.161435 | 1.989723 | 0.185164 | ... | 0.158296 | 0.422121 | 0.321306 | 0.229193 | 0.355213 | 1.430825 | Ts65Dn | Saline | S/C | t-SC-s |
| 1078 | J3295_14 | 0.221242 | 0.412894 | 0.243974 | 1.876347 | 2.384088 | 0.208897 | 0.173623 | 2.086028 | 0.192044 | ... | 0.196296 | 0.397676 | 0.335936 | 0.251317 | 0.365353 | 1.404031 | Ts65Dn | Saline | S/C | t-SC-s |
| 1079 | J3295_15 | 0.302626 | 0.461059 | 0.256564 | 2.092790 | 2.594348 | 0.251001 | 0.191811 | 2.361816 | 0.223632 | ... | 0.187556 | 0.420347 | 0.335062 | 0.252995 | 0.365278 | 1.370999 | Ts65Dn | Saline | S/C | t-SC-s |
1080 rows × 82 columns
Row indexes do NOT have to be numeric or in any particular order, but they must be unique.
df.index = [f"Mouse {i}" for i in df.index]
df
| MouseID | DYRK1A_N | ITSN1_N | BDNF_N | NR1_N | NR2A_N | pAKT_N | pBRAF_N | pCAMKII_N | pCREB_N | ... | pCFOS_N | SYP_N | H3AcK18_N | EGR1_N | H3MeK4_N | CaNA_N | Genotype | Treatment | Behavior | class | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Mouse 0 | 309_1 | 0.503644 | 0.747193 | 0.430175 | 2.816329 | 5.990152 | 0.218830 | 0.177565 | 2.373744 | 0.232224 | ... | 0.108336 | 0.427099 | 0.114783 | 0.131790 | 0.128186 | 1.675652 | Control | Memantine | C/S | c-CS-m |
| Mouse 1 | 309_2 | 0.514617 | 0.689064 | 0.411770 | 2.789514 | 5.685038 | 0.211636 | 0.172817 | 2.292150 | 0.226972 | ... | 0.104315 | 0.441581 | 0.111974 | 0.135103 | 0.131119 | 1.743610 | Control | Memantine | C/S | c-CS-m |
| Mouse 2 | 309_3 | 0.509183 | 0.730247 | 0.418309 | 2.687201 | 5.622059 | 0.209011 | 0.175722 | 2.283337 | 0.230247 | ... | 0.106219 | 0.435777 | 0.111883 | 0.133362 | 0.127431 | 1.926427 | Control | Memantine | C/S | c-CS-m |
| Mouse 3 | 309_4 | 0.442107 | 0.617076 | 0.358626 | 2.466947 | 4.979503 | 0.222886 | 0.176463 | 2.152301 | 0.207004 | ... | 0.111262 | 0.391691 | 0.130405 | 0.147444 | 0.146901 | 1.700563 | Control | Memantine | C/S | c-CS-m |
| Mouse 4 | 309_5 | 0.434940 | 0.617430 | 0.358802 | 2.365785 | 4.718679 | 0.213106 | 0.173627 | 2.134014 | 0.192158 | ... | 0.110694 | 0.434154 | 0.118481 | 0.140314 | 0.148380 | 1.839730 | Control | Memantine | C/S | c-CS-m |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| Mouse 1075 | J3295_11 | 0.254860 | 0.463591 | 0.254860 | 2.092082 | 2.600035 | 0.211736 | 0.171262 | 2.483740 | 0.207317 | ... | 0.183324 | 0.374088 | 0.318782 | 0.204660 | 0.328327 | 1.364823 | Ts65Dn | Saline | S/C | t-SC-s |
| Mouse 1076 | J3295_12 | 0.272198 | 0.474163 | 0.251638 | 2.161390 | 2.801492 | 0.251274 | 0.182496 | 2.512737 | 0.216339 | ... | 0.175674 | 0.375259 | 0.325639 | 0.200415 | 0.293435 | 1.364478 | Ts65Dn | Saline | S/C | t-SC-s |
| Mouse 1077 | J3295_13 | 0.228700 | 0.395179 | 0.234118 | 1.733184 | 2.220852 | 0.220665 | 0.161435 | 1.989723 | 0.185164 | ... | 0.158296 | 0.422121 | 0.321306 | 0.229193 | 0.355213 | 1.430825 | Ts65Dn | Saline | S/C | t-SC-s |
| Mouse 1078 | J3295_14 | 0.221242 | 0.412894 | 0.243974 | 1.876347 | 2.384088 | 0.208897 | 0.173623 | 2.086028 | 0.192044 | ... | 0.196296 | 0.397676 | 0.335936 | 0.251317 | 0.365353 | 1.404031 | Ts65Dn | Saline | S/C | t-SC-s |
| Mouse 1079 | J3295_15 | 0.302626 | 0.461059 | 0.256564 | 2.092790 | 2.594348 | 0.251001 | 0.191811 | 2.361816 | 0.223632 | ... | 0.187556 | 0.420347 | 0.335062 | 0.252995 | 0.365278 | 1.370999 | Ts65Dn | Saline | S/C | t-SC-s |
1080 rows × 82 columns
df.info()
<class 'pandas.core.frame.DataFrame'> Index: 1080 entries, Mouse 0 to Mouse 1079 Data columns (total 82 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 MouseID 1080 non-null object 1 DYRK1A_N 1077 non-null float64 2 ITSN1_N 1077 non-null float64 3 BDNF_N 1077 non-null float64 4 NR1_N 1077 non-null float64 5 NR2A_N 1077 non-null float64 6 pAKT_N 1077 non-null float64 7 pBRAF_N 1077 non-null float64 8 pCAMKII_N 1077 non-null float64 9 pCREB_N 1077 non-null float64 10 pELK_N 1077 non-null float64 11 pERK_N 1077 non-null float64 12 pJNK_N 1077 non-null float64 13 PKCA_N 1077 non-null float64 14 pMEK_N 1077 non-null float64 15 pNR1_N 1077 non-null float64 16 pNR2A_N 1077 non-null float64 17 pNR2B_N 1077 non-null float64 18 pPKCAB_N 1077 non-null float64 19 pRSK_N 1077 non-null float64 20 AKT_N 1077 non-null float64 21 BRAF_N 1077 non-null float64 22 CAMKII_N 1077 non-null float64 23 CREB_N 1077 non-null float64 24 ELK_N 1062 non-null float64 25 ERK_N 1077 non-null float64 26 GSK3B_N 1077 non-null float64 27 JNK_N 1077 non-null float64 28 MEK_N 1073 non-null float64 29 TRKA_N 1077 non-null float64 30 RSK_N 1077 non-null float64 31 APP_N 1077 non-null float64 32 Bcatenin_N 1062 non-null float64 33 SOD1_N 1077 non-null float64 34 MTOR_N 1077 non-null float64 35 P38_N 1077 non-null float64 36 pMTOR_N 1077 non-null float64 37 DSCR1_N 1077 non-null float64 38 AMPKA_N 1077 non-null float64 39 NR2B_N 1077 non-null float64 40 pNUMB_N 1077 non-null float64 41 RAPTOR_N 1077 non-null float64 42 TIAM1_N 1077 non-null float64 43 pP70S6_N 1077 non-null float64 44 NUMB_N 1080 non-null float64 45 P70S6_N 1080 non-null float64 46 pGSK3B_N 1080 non-null float64 47 pPKCG_N 1080 non-null float64 48 CDK5_N 1080 non-null float64 49 S6_N 1080 non-null float64 50 ADARB1_N 1080 non-null float64 51 AcetylH3K9_N 1080 non-null float64 52 RRP1_N 1080 non-null float64 53 BAX_N 1080 non-null float64 54 ARC_N 1080 non-null float64 55 ERBB4_N 1080 non-null float64 56 nNOS_N 1080 non-null float64 57 Tau_N 1080 non-null float64 58 GFAP_N 1080 non-null float64 59 GluR3_N 1080 non-null float64 60 GluR4_N 1080 non-null float64 61 IL1B_N 1080 non-null float64 62 P3525_N 1080 non-null float64 63 pCASP9_N 1080 non-null float64 64 PSD95_N 1080 non-null float64 65 SNCA_N 1080 non-null float64 66 Ubiquitin_N 1080 non-null float64 67 pGSK3B_Tyr216_N 1080 non-null float64 68 SHH_N 1080 non-null float64 69 BAD_N 867 non-null float64 70 BCL2_N 795 non-null float64 71 pS6_N 1080 non-null float64 72 pCFOS_N 1005 non-null float64 73 SYP_N 1080 non-null float64 74 H3AcK18_N 900 non-null float64 75 EGR1_N 870 non-null float64 76 H3MeK4_N 810 non-null float64 77 CaNA_N 1080 non-null float64 78 Genotype 1080 non-null object 79 Treatment 1080 non-null object 80 Behavior 1080 non-null object 81 class 1080 non-null object dtypes: float64(77), object(5) memory usage: 700.3+ KB
Some statistics for each column.
df.describe()
| DYRK1A_N | ITSN1_N | BDNF_N | NR1_N | NR2A_N | pAKT_N | pBRAF_N | pCAMKII_N | pCREB_N | pELK_N | ... | SHH_N | BAD_N | BCL2_N | pS6_N | pCFOS_N | SYP_N | H3AcK18_N | EGR1_N | H3MeK4_N | CaNA_N | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 1077.000000 | 1077.000000 | 1077.000000 | 1077.000000 | 1077.000000 | 1077.000000 | 1077.000000 | 1077.000000 | 1077.000000 | 1077.000000 | ... | 1080.000000 | 867.000000 | 795.000000 | 1080.000000 | 1005.000000 | 1080.000000 | 900.000000 | 870.000000 | 810.000000 | 1080.000000 |
| mean | 0.425810 | 0.617102 | 0.319088 | 2.297269 | 3.843934 | 0.233168 | 0.181846 | 3.537109 | 0.212574 | 1.428682 | ... | 0.226676 | 0.157914 | 0.134762 | 0.121521 | 0.131053 | 0.446073 | 0.169609 | 0.183135 | 0.205440 | 1.337784 |
| std | 0.249362 | 0.251640 | 0.049383 | 0.347293 | 0.933100 | 0.041634 | 0.027042 | 1.295169 | 0.032587 | 0.466904 | ... | 0.028989 | 0.029537 | 0.027417 | 0.014276 | 0.023863 | 0.066432 | 0.059402 | 0.040406 | 0.055514 | 0.317126 |
| min | 0.145327 | 0.245359 | 0.115181 | 1.330831 | 1.737540 | 0.063236 | 0.064043 | 1.343998 | 0.112812 | 0.429032 | ... | 0.155869 | 0.088305 | 0.080657 | 0.067254 | 0.085419 | 0.258626 | 0.079691 | 0.105537 | 0.101787 | 0.586479 |
| 25% | 0.288121 | 0.473361 | 0.287444 | 2.057411 | 3.155678 | 0.205755 | 0.164595 | 2.479834 | 0.190823 | 1.203665 | ... | 0.206395 | 0.136424 | 0.115554 | 0.110839 | 0.113506 | 0.398082 | 0.125848 | 0.155121 | 0.165143 | 1.081423 |
| 50% | 0.366378 | 0.565782 | 0.316564 | 2.296546 | 3.760855 | 0.231177 | 0.182302 | 3.326520 | 0.210594 | 1.355846 | ... | 0.224000 | 0.152313 | 0.129468 | 0.121626 | 0.126523 | 0.448459 | 0.158240 | 0.174935 | 0.193994 | 1.317441 |
| 75% | 0.487711 | 0.698032 | 0.348197 | 2.528481 | 4.440011 | 0.257261 | 0.197418 | 4.481940 | 0.234595 | 1.561316 | ... | 0.241655 | 0.174017 | 0.148235 | 0.131955 | 0.143652 | 0.490773 | 0.197876 | 0.204542 | 0.235215 | 1.585824 |
| max | 2.516367 | 2.602662 | 0.497160 | 3.757641 | 8.482553 | 0.539050 | 0.317066 | 7.464070 | 0.306247 | 6.113347 | ... | 0.358289 | 0.282016 | 0.261506 | 0.158748 | 0.256529 | 0.759588 | 0.479763 | 0.360692 | 0.413903 | 2.129791 |
8 rows × 77 columns
Let's get rid of all the _N in the column titles to make things easier to look at.
coltitles = list(df.columns)
for i in range(len(coltitles)):
if coltitles[i].endswith('_N'):
coltitles[i] = coltitles[i][:-2]
df.columns = coltitles
df
| MouseID | DYRK1A | ITSN1 | BDNF | NR1 | NR2A | pAKT | pBRAF | pCAMKII | pCREB | ... | pCFOS | SYP | H3AcK18 | EGR1 | H3MeK4 | CaNA | Genotype | Treatment | Behavior | class | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Mouse 0 | 309_1 | 0.503644 | 0.747193 | 0.430175 | 2.816329 | 5.990152 | 0.218830 | 0.177565 | 2.373744 | 0.232224 | ... | 0.108336 | 0.427099 | 0.114783 | 0.131790 | 0.128186 | 1.675652 | Control | Memantine | C/S | c-CS-m |
| Mouse 1 | 309_2 | 0.514617 | 0.689064 | 0.411770 | 2.789514 | 5.685038 | 0.211636 | 0.172817 | 2.292150 | 0.226972 | ... | 0.104315 | 0.441581 | 0.111974 | 0.135103 | 0.131119 | 1.743610 | Control | Memantine | C/S | c-CS-m |
| Mouse 2 | 309_3 | 0.509183 | 0.730247 | 0.418309 | 2.687201 | 5.622059 | 0.209011 | 0.175722 | 2.283337 | 0.230247 | ... | 0.106219 | 0.435777 | 0.111883 | 0.133362 | 0.127431 | 1.926427 | Control | Memantine | C/S | c-CS-m |
| Mouse 3 | 309_4 | 0.442107 | 0.617076 | 0.358626 | 2.466947 | 4.979503 | 0.222886 | 0.176463 | 2.152301 | 0.207004 | ... | 0.111262 | 0.391691 | 0.130405 | 0.147444 | 0.146901 | 1.700563 | Control | Memantine | C/S | c-CS-m |
| Mouse 4 | 309_5 | 0.434940 | 0.617430 | 0.358802 | 2.365785 | 4.718679 | 0.213106 | 0.173627 | 2.134014 | 0.192158 | ... | 0.110694 | 0.434154 | 0.118481 | 0.140314 | 0.148380 | 1.839730 | Control | Memantine | C/S | c-CS-m |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| Mouse 1075 | J3295_11 | 0.254860 | 0.463591 | 0.254860 | 2.092082 | 2.600035 | 0.211736 | 0.171262 | 2.483740 | 0.207317 | ... | 0.183324 | 0.374088 | 0.318782 | 0.204660 | 0.328327 | 1.364823 | Ts65Dn | Saline | S/C | t-SC-s |
| Mouse 1076 | J3295_12 | 0.272198 | 0.474163 | 0.251638 | 2.161390 | 2.801492 | 0.251274 | 0.182496 | 2.512737 | 0.216339 | ... | 0.175674 | 0.375259 | 0.325639 | 0.200415 | 0.293435 | 1.364478 | Ts65Dn | Saline | S/C | t-SC-s |
| Mouse 1077 | J3295_13 | 0.228700 | 0.395179 | 0.234118 | 1.733184 | 2.220852 | 0.220665 | 0.161435 | 1.989723 | 0.185164 | ... | 0.158296 | 0.422121 | 0.321306 | 0.229193 | 0.355213 | 1.430825 | Ts65Dn | Saline | S/C | t-SC-s |
| Mouse 1078 | J3295_14 | 0.221242 | 0.412894 | 0.243974 | 1.876347 | 2.384088 | 0.208897 | 0.173623 | 2.086028 | 0.192044 | ... | 0.196296 | 0.397676 | 0.335936 | 0.251317 | 0.365353 | 1.404031 | Ts65Dn | Saline | S/C | t-SC-s |
| Mouse 1079 | J3295_15 | 0.302626 | 0.461059 | 0.256564 | 2.092790 | 2.594348 | 0.251001 | 0.191811 | 2.361816 | 0.223632 | ... | 0.187556 | 0.420347 | 0.335062 | 0.252995 | 0.365278 | 1.370999 | Ts65Dn | Saline | S/C | t-SC-s |
1080 rows × 82 columns
Index/slice into the rows of the DataFrame just like a NumPy array.
df[:5] # first 5 rows
| MouseID | DYRK1A | ITSN1 | BDNF | NR1 | NR2A | pAKT | pBRAF | pCAMKII | pCREB | ... | pCFOS | SYP | H3AcK18 | EGR1 | H3MeK4 | CaNA | Genotype | Treatment | Behavior | class | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Mouse 0 | 309_1 | 0.503644 | 0.747193 | 0.430175 | 2.816329 | 5.990152 | 0.218830 | 0.177565 | 2.373744 | 0.232224 | ... | 0.108336 | 0.427099 | 0.114783 | 0.131790 | 0.128186 | 1.675652 | Control | Memantine | C/S | c-CS-m |
| Mouse 1 | 309_2 | 0.514617 | 0.689064 | 0.411770 | 2.789514 | 5.685038 | 0.211636 | 0.172817 | 2.292150 | 0.226972 | ... | 0.104315 | 0.441581 | 0.111974 | 0.135103 | 0.131119 | 1.743610 | Control | Memantine | C/S | c-CS-m |
| Mouse 2 | 309_3 | 0.509183 | 0.730247 | 0.418309 | 2.687201 | 5.622059 | 0.209011 | 0.175722 | 2.283337 | 0.230247 | ... | 0.106219 | 0.435777 | 0.111883 | 0.133362 | 0.127431 | 1.926427 | Control | Memantine | C/S | c-CS-m |
| Mouse 3 | 309_4 | 0.442107 | 0.617076 | 0.358626 | 2.466947 | 4.979503 | 0.222886 | 0.176463 | 2.152301 | 0.207004 | ... | 0.111262 | 0.391691 | 0.130405 | 0.147444 | 0.146901 | 1.700563 | Control | Memantine | C/S | c-CS-m |
| Mouse 4 | 309_5 | 0.434940 | 0.617430 | 0.358802 | 2.365785 | 4.718679 | 0.213106 | 0.173627 | 2.134014 | 0.192158 | ... | 0.110694 | 0.434154 | 0.118481 | 0.140314 | 0.148380 | 1.839730 | Control | Memantine | C/S | c-CS-m |
5 rows × 82 columns
df[-9::2] # last 9 rows every other
| MouseID | DYRK1A | ITSN1 | BDNF | NR1 | NR2A | pAKT | pBRAF | pCAMKII | pCREB | ... | pCFOS | SYP | H3AcK18 | EGR1 | H3MeK4 | CaNA | Genotype | Treatment | Behavior | class | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Mouse 1071 | J3295_7 | 0.318448 | 0.509701 | 0.290576 | 2.314752 | 3.041115 | 0.229135 | 0.175855 | 2.881429 | 0.206036 | ... | 0.127265 | 0.405127 | 0.307199 | 0.192350 | 0.269951 | 1.340743 | Ts65Dn | Saline | S/C | t-SC-s |
| Mouse 1073 | J3295_9 | 0.280946 | 0.528821 | 0.299645 | 2.290991 | 3.243548 | 0.246021 | 0.195797 | 2.946067 | 0.221759 | ... | 0.105369 | 0.408533 | 0.277181 | 0.182646 | 0.237009 | 1.297315 | Ts65Dn | Saline | S/C | t-SC-s |
| Mouse 1075 | J3295_11 | 0.254860 | 0.463591 | 0.254860 | 2.092082 | 2.600035 | 0.211736 | 0.171262 | 2.483740 | 0.207317 | ... | 0.183324 | 0.374088 | 0.318782 | 0.204660 | 0.328327 | 1.364823 | Ts65Dn | Saline | S/C | t-SC-s |
| Mouse 1077 | J3295_13 | 0.228700 | 0.395179 | 0.234118 | 1.733184 | 2.220852 | 0.220665 | 0.161435 | 1.989723 | 0.185164 | ... | 0.158296 | 0.422121 | 0.321306 | 0.229193 | 0.355213 | 1.430825 | Ts65Dn | Saline | S/C | t-SC-s |
| Mouse 1079 | J3295_15 | 0.302626 | 0.461059 | 0.256564 | 2.092790 | 2.594348 | 0.251001 | 0.191811 | 2.361816 | 0.223632 | ... | 0.187556 | 0.420347 | 0.335062 | 0.252995 | 0.365278 | 1.370999 | Ts65Dn | Saline | S/C | t-SC-s |
5 rows × 82 columns
Get a specific column by name.
df['Genotype']
Mouse 0 Control
Mouse 1 Control
Mouse 2 Control
Mouse 3 Control
Mouse 4 Control
...
Mouse 1075 Ts65Dn
Mouse 1076 Ts65Dn
Mouse 1077 Ts65Dn
Mouse 1078 Ts65Dn
Mouse 1079 Ts65Dn
Name: Genotype, Length: 1080, dtype: object
Get multiple columns by name in any order.
df[['Genotype', 'BDNF']]
| Genotype | BDNF | |
|---|---|---|
| Mouse 0 | Control | 0.430175 |
| Mouse 1 | Control | 0.411770 |
| Mouse 2 | Control | 0.418309 |
| Mouse 3 | Control | 0.358626 |
| Mouse 4 | Control | 0.358802 |
| ... | ... | ... |
| Mouse 1075 | Ts65Dn | 0.254860 |
| Mouse 1076 | Ts65Dn | 0.251638 |
| Mouse 1077 | Ts65Dn | 0.234118 |
| Mouse 1078 | Ts65Dn | 0.243974 |
| Mouse 1079 | Ts65Dn | 0.256564 |
1080 rows × 2 columns
- Single column $\rightarrow$ Series object
- Multiple columns $\rightarrow$ DataFrame oject
type(df['Genotype']), type(df[['Genotype', 'BDNF']])
(pandas.core.series.Series, pandas.core.frame.DataFrame)
But you can still get a single column as a DataFrame object if you index with a list that has just a single column title.
df[['Genotype']]
| Genotype | |
|---|---|
| Mouse 0 | Control |
| Mouse 1 | Control |
| Mouse 2 | Control |
| Mouse 3 | Control |
| Mouse 4 | Control |
| ... | ... |
| Mouse 1075 | Ts65Dn |
| Mouse 1076 | Ts65Dn |
| Mouse 1077 | Ts65Dn |
| Mouse 1078 | Ts65Dn |
| Mouse 1079 | Ts65Dn |
1080 rows × 1 columns
It is easy to combine column and row indexing.
df['Genotype'][1:5] # rows 1-4 in Genotype column
Mouse 1 Control Mouse 2 Control Mouse 3 Control Mouse 4 Control Name: Genotype, dtype: object
df[['Genotype', 'BDNF']][-9::3] # last 9 rows every third for two columns
| Genotype | BDNF | |
|---|---|---|
| Mouse 1071 | Ts65Dn | 0.290576 |
| Mouse 1074 | Ts65Dn | 0.254439 |
| Mouse 1077 | Ts65Dn | 0.234118 |
Exercise¶
Get the class column for the mice in rows 2-4.
Exercise Key¶
Get the class column for the mice in rows 2-4.
df[['class']][2:5]
| class | |
|---|---|
| Mouse 2 | c-CS-m |
| Mouse 3 | c-CS-m |
| Mouse 4 | c-CS-m |
Exercise¶
Get the class and Treatment columns for the mice in rows 2-4.
Exercise Key¶
Get the class and Treatment columns for the mice in rows 2-4.
df[['class', 'Treatment']][2:5]
| class | Treatment | |
|---|---|---|
| Mouse 2 | c-CS-m | Memantine |
| Mouse 3 | c-CS-m | Memantine |
| Mouse 4 | c-CS-m | Memantine |
iloc --> slice like NumPy¶
iloc[rows, columns]: 2D slice just like numpy
df.iloc[3:9:2, 5:10]
| NR2A | pAKT | pBRAF | pCAMKII | pCREB | |
|---|---|---|---|---|---|
| Mouse 3 | 4.979503 | 0.222886 | 0.176463 | 2.152301 | 0.207004 |
| Mouse 5 | 4.807635 | 0.218578 | 0.176233 | 2.141282 | 0.195188 |
| Mouse 7 | 4.268735 | 0.214834 | 0.179668 | 2.007985 | 0.189803 |
loc --> slice by name¶
loc[rows, columns]
first:last: Note that both the first and last are included (finally, sanity!).
df.loc['Mouse 1077':'Mouse 1079', 'Genotype':'class']
| Genotype | Treatment | Behavior | class | |
|---|---|---|---|---|
| Mouse 1077 | Ts65Dn | Saline | S/C | t-SC-s |
| Mouse 1078 | Ts65Dn | Saline | S/C | t-SC-s |
| Mouse 1079 | Ts65Dn | Saline | S/C | t-SC-s |
List of [rows,...], [columns,...] in any order.
df.loc[['Mouse 3', 'Mouse 82', 'Mouse 500'], ['Genotype', 'BDNF']]
| Genotype | BDNF | |
|---|---|---|
| Mouse 3 | Control | 0.358626 |
| Mouse 82 | Control | 0.299948 |
| Mouse 500 | Control | 0.295143 |
Exercise¶
Use loc to get the class and Treatment for the mice from Mouse 53 to Mouse 60.
Exercise Key¶
Use loc to get the class and Treatment for the mice from Mouse 53 to Mouse 60.
df.loc['Mouse 53':'Mouse 60', ['class', 'Treatment']]
| class | Treatment | |
|---|---|---|
| Mouse 53 | c-CS-m | Memantine |
| Mouse 54 | c-CS-m | Memantine |
| Mouse 55 | c-CS-m | Memantine |
| Mouse 56 | c-CS-m | Memantine |
| Mouse 57 | c-CS-m | Memantine |
| Mouse 58 | c-CS-m | Memantine |
| Mouse 59 | c-CS-m | Memantine |
| Mouse 60 | c-CS-m | Memantine |
Select rows with a logical array¶
df['Genotype']
Mouse 0 Control
Mouse 1 Control
Mouse 2 Control
Mouse 3 Control
Mouse 4 Control
...
Mouse 1075 Ts65Dn
Mouse 1076 Ts65Dn
Mouse 1077 Ts65Dn
Mouse 1078 Ts65Dn
Mouse 1079 Ts65Dn
Name: Genotype, Length: 1080, dtype: object
df['Genotype'] == 'Ts65Dn'
Mouse 0 False
Mouse 1 False
Mouse 2 False
Mouse 3 False
Mouse 4 False
...
Mouse 1075 True
Mouse 1076 True
Mouse 1077 True
Mouse 1078 True
Mouse 1079 True
Name: Genotype, Length: 1080, dtype: bool
has_down_syndrome = (df['Genotype'] == 'Ts65Dn')
# only rows for down syndrome mice
df[has_down_syndrome]
| MouseID | DYRK1A | ITSN1 | BDNF | NR1 | NR2A | pAKT | pBRAF | pCAMKII | pCREB | ... | pCFOS | SYP | H3AcK18 | EGR1 | H3MeK4 | CaNA | Genotype | Treatment | Behavior | class | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Mouse 570 | 3414_1 | 0.838053 | 1.006324 | 0.364953 | 2.603061 | 4.207680 | 0.202548 | 0.164055 | 2.222619 | 0.230776 | ... | NaN | 0.400078 | 0.112587 | 0.107095 | 0.123739 | 1.829242 | Ts65Dn | Memantine | C/S | t-CS-m |
| Mouse 571 | 3414_2 | 0.938245 | 1.077099 | 0.366149 | 2.703612 | 4.484228 | 0.197465 | 0.169923 | 2.289908 | 0.249214 | ... | NaN | 0.402721 | 0.112256 | 0.105537 | 0.128380 | 1.831980 | Ts65Dn | Memantine | C/S | t-CS-m |
| Mouse 572 | 3414_3 | 0.956488 | 1.108441 | 0.384049 | 2.662564 | 4.606357 | 0.203799 | 0.171625 | 2.311949 | 0.253610 | ... | NaN | 0.403164 | 0.108300 | 0.114087 | 0.129905 | 1.809744 | Ts65Dn | Memantine | C/S | t-CS-m |
| Mouse 573 | 3414_4 | 0.731738 | 0.944241 | 0.324365 | 2.320243 | 3.621250 | 0.198649 | 0.162240 | 2.257041 | 0.203458 | ... | NaN | 0.388400 | 0.117319 | 0.115926 | 0.133304 | 1.795498 | Ts65Dn | Memantine | C/S | t-CS-m |
| Mouse 574 | 3414_5 | 0.815503 | 1.005562 | 0.335976 | 2.379290 | 3.641657 | 0.217870 | 0.164142 | 2.291124 | 0.209941 | ... | NaN | 0.374617 | 0.120861 | 0.110322 | 0.133941 | 1.823529 | Ts65Dn | Memantine | C/S | t-CS-m |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| Mouse 1075 | J3295_11 | 0.254860 | 0.463591 | 0.254860 | 2.092082 | 2.600035 | 0.211736 | 0.171262 | 2.483740 | 0.207317 | ... | 0.183324 | 0.374088 | 0.318782 | 0.204660 | 0.328327 | 1.364823 | Ts65Dn | Saline | S/C | t-SC-s |
| Mouse 1076 | J3295_12 | 0.272198 | 0.474163 | 0.251638 | 2.161390 | 2.801492 | 0.251274 | 0.182496 | 2.512737 | 0.216339 | ... | 0.175674 | 0.375259 | 0.325639 | 0.200415 | 0.293435 | 1.364478 | Ts65Dn | Saline | S/C | t-SC-s |
| Mouse 1077 | J3295_13 | 0.228700 | 0.395179 | 0.234118 | 1.733184 | 2.220852 | 0.220665 | 0.161435 | 1.989723 | 0.185164 | ... | 0.158296 | 0.422121 | 0.321306 | 0.229193 | 0.355213 | 1.430825 | Ts65Dn | Saline | S/C | t-SC-s |
| Mouse 1078 | J3295_14 | 0.221242 | 0.412894 | 0.243974 | 1.876347 | 2.384088 | 0.208897 | 0.173623 | 2.086028 | 0.192044 | ... | 0.196296 | 0.397676 | 0.335936 | 0.251317 | 0.365353 | 1.404031 | Ts65Dn | Saline | S/C | t-SC-s |
| Mouse 1079 | J3295_15 | 0.302626 | 0.461059 | 0.256564 | 2.092790 | 2.594348 | 0.251001 | 0.191811 | 2.361816 | 0.223632 | ... | 0.187556 | 0.420347 | 0.335062 | 0.252995 | 0.365278 | 1.370999 | Ts65Dn | Saline | S/C | t-SC-s |
510 rows × 82 columns
got_saline = (df['Treatment'] == 'Saline')
# only rows for down syndrome mice that recieved saline treatment
df[has_down_syndrome & got_saline]
| MouseID | DYRK1A | ITSN1 | BDNF | NR1 | NR2A | pAKT | pBRAF | pCAMKII | pCREB | ... | pCFOS | SYP | H3AcK18 | EGR1 | H3MeK4 | CaNA | Genotype | Treatment | Behavior | class | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Mouse 840 | 18899_1 | 0.506200 | 0.696046 | 0.316019 | 2.204591 | 4.154323 | 0.254859 | 0.180295 | 2.557473 | 0.192694 | ... | NaN | 0.397663 | 0.155484 | NaN | 0.187052 | 1.357802 | Ts65Dn | Saline | C/S | t-CS-s |
| Mouse 841 | 18899_2 | 0.523760 | 0.746212 | 0.324897 | 2.285640 | 4.322314 | 0.268767 | 0.194387 | 2.648244 | 0.198864 | ... | NaN | 0.388920 | 0.164507 | NaN | 0.182911 | 1.327418 | Ts65Dn | Saline | C/S | t-CS-s |
| Mouse 842 | 18899_3 | 0.518612 | 0.733233 | 0.356137 | 2.330148 | 4.631455 | 0.272468 | 0.186284 | 2.624078 | 0.192656 | ... | NaN | 0.387212 | 0.156970 | NaN | 0.179930 | 1.267106 | Ts65Dn | Saline | C/S | t-CS-s |
| Mouse 843 | 18899_4 | 0.436986 | 0.626614 | 0.295108 | 2.008023 | 3.605088 | 0.258317 | 0.183562 | 2.648141 | 0.194521 | ... | NaN | 0.428589 | 0.187594 | NaN | 0.233242 | 1.329915 | Ts65Dn | Saline | C/S | t-CS-s |
| Mouse 844 | 18899_5 | 0.505599 | 0.719826 | 0.314600 | 2.194110 | 3.908544 | 0.281833 | 0.200539 | 2.804231 | 0.210701 | ... | NaN | 0.416420 | 0.182469 | NaN | 0.231235 | 1.297778 | Ts65Dn | Saline | C/S | t-CS-s |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| Mouse 1075 | J3295_11 | 0.254860 | 0.463591 | 0.254860 | 2.092082 | 2.600035 | 0.211736 | 0.171262 | 2.483740 | 0.207317 | ... | 0.183324 | 0.374088 | 0.318782 | 0.204660 | 0.328327 | 1.364823 | Ts65Dn | Saline | S/C | t-SC-s |
| Mouse 1076 | J3295_12 | 0.272198 | 0.474163 | 0.251638 | 2.161390 | 2.801492 | 0.251274 | 0.182496 | 2.512737 | 0.216339 | ... | 0.175674 | 0.375259 | 0.325639 | 0.200415 | 0.293435 | 1.364478 | Ts65Dn | Saline | S/C | t-SC-s |
| Mouse 1077 | J3295_13 | 0.228700 | 0.395179 | 0.234118 | 1.733184 | 2.220852 | 0.220665 | 0.161435 | 1.989723 | 0.185164 | ... | 0.158296 | 0.422121 | 0.321306 | 0.229193 | 0.355213 | 1.430825 | Ts65Dn | Saline | S/C | t-SC-s |
| Mouse 1078 | J3295_14 | 0.221242 | 0.412894 | 0.243974 | 1.876347 | 2.384088 | 0.208897 | 0.173623 | 2.086028 | 0.192044 | ... | 0.196296 | 0.397676 | 0.335936 | 0.251317 | 0.365353 | 1.404031 | Ts65Dn | Saline | S/C | t-SC-s |
| Mouse 1079 | J3295_15 | 0.302626 | 0.461059 | 0.256564 | 2.092790 | 2.594348 | 0.251001 | 0.191811 | 2.361816 | 0.223632 | ... | 0.187556 | 0.420347 | 0.335062 | 0.252995 | 0.365278 | 1.370999 | Ts65Dn | Saline | S/C | t-SC-s |
240 rows × 82 columns
# MouseID and pCFOS levels
# for only down syndrome mice that recieved saline treatment
df[has_down_syndrome & got_saline][['MouseID', 'pCFOS']]
| MouseID | pCFOS | |
|---|---|---|
| Mouse 840 | 18899_1 | NaN |
| Mouse 841 | 18899_2 | NaN |
| Mouse 842 | 18899_3 | NaN |
| Mouse 843 | 18899_4 | NaN |
| Mouse 844 | 18899_5 | NaN |
| ... | ... | ... |
| Mouse 1075 | J3295_11 | 0.183324 |
| Mouse 1076 | J3295_12 | 0.175674 |
| Mouse 1077 | J3295_13 | 0.158296 |
| Mouse 1078 | J3295_14 | 0.196296 |
| Mouse 1079 | J3295_15 | 0.187556 |
240 rows × 2 columns
Missing values¶
df2 = df[has_down_syndrome & got_saline][['MouseID', 'pCFOS']]
df2
| MouseID | pCFOS | |
|---|---|---|
| Mouse 840 | 18899_1 | NaN |
| Mouse 841 | 18899_2 | NaN |
| Mouse 842 | 18899_3 | NaN |
| Mouse 843 | 18899_4 | NaN |
| Mouse 844 | 18899_5 | NaN |
| ... | ... | ... |
| Mouse 1075 | J3295_11 | 0.183324 |
| Mouse 1076 | J3295_12 | 0.175674 |
| Mouse 1077 | J3295_13 | 0.158296 |
| Mouse 1078 | J3295_14 | 0.196296 |
| Mouse 1079 | J3295_15 | 0.187556 |
240 rows × 2 columns
Get rid of any mice with missing data.
df2.dropna()
| MouseID | pCFOS | |
|---|---|---|
| Mouse 855 | 3476_1 | 0.109931 |
| Mouse 856 | 3476_2 | 0.123052 |
| Mouse 857 | 3476_3 | 0.114269 |
| Mouse 858 | 3476_4 | 0.111655 |
| Mouse 859 | 3476_5 | 0.123541 |
| ... | ... | ... |
| Mouse 1075 | J3295_11 | 0.183324 |
| Mouse 1076 | J3295_12 | 0.175674 |
| Mouse 1077 | J3295_13 | 0.158296 |
| Mouse 1078 | J3295_14 | 0.196296 |
| Mouse 1079 | J3295_15 | 0.187556 |
210 rows × 2 columns
Replace missing values with zeros.
df2.fillna(0)
| MouseID | pCFOS | |
|---|---|---|
| Mouse 840 | 18899_1 | 0.000000 |
| Mouse 841 | 18899_2 | 0.000000 |
| Mouse 842 | 18899_3 | 0.000000 |
| Mouse 843 | 18899_4 | 0.000000 |
| Mouse 844 | 18899_5 | 0.000000 |
| ... | ... | ... |
| Mouse 1075 | J3295_11 | 0.183324 |
| Mouse 1076 | J3295_12 | 0.175674 |
| Mouse 1077 | J3295_13 | 0.158296 |
| Mouse 1078 | J3295_14 | 0.196296 |
| Mouse 1079 | J3295_15 | 0.187556 |
240 rows × 2 columns
Note that generally operations do NOT alter the dataframe.
Despite having run the dropna and fillna methods, df2 still has its original missing values.
df2
| MouseID | pCFOS | |
|---|---|---|
| Mouse 840 | 18899_1 | NaN |
| Mouse 841 | 18899_2 | NaN |
| Mouse 842 | 18899_3 | NaN |
| Mouse 843 | 18899_4 | NaN |
| Mouse 844 | 18899_5 | NaN |
| ... | ... | ... |
| Mouse 1075 | J3295_11 | 0.183324 |
| Mouse 1076 | J3295_12 | 0.175674 |
| Mouse 1077 | J3295_13 | 0.158296 |
| Mouse 1078 | J3295_14 | 0.196296 |
| Mouse 1079 | J3295_15 | 0.187556 |
240 rows × 2 columns
To keep the results of an operation you generally have two options:
- Store the result in a new dataframe.
- Apply the operation "inplace".
df3 = df2.fillna(0)
df3
| MouseID | pCFOS | |
|---|---|---|
| Mouse 840 | 18899_1 | 0.000000 |
| Mouse 841 | 18899_2 | 0.000000 |
| Mouse 842 | 18899_3 | 0.000000 |
| Mouse 843 | 18899_4 | 0.000000 |
| Mouse 844 | 18899_5 | 0.000000 |
| ... | ... | ... |
| Mouse 1075 | J3295_11 | 0.183324 |
| Mouse 1076 | J3295_12 | 0.175674 |
| Mouse 1077 | J3295_13 | 0.158296 |
| Mouse 1078 | J3295_14 | 0.196296 |
| Mouse 1079 | J3295_15 | 0.187556 |
240 rows × 2 columns
df2.fillna(-1, inplace=True)
df2
| MouseID | pCFOS | |
|---|---|---|
| Mouse 840 | 18899_1 | -1.000000 |
| Mouse 841 | 18899_2 | -1.000000 |
| Mouse 842 | 18899_3 | -1.000000 |
| Mouse 843 | 18899_4 | -1.000000 |
| Mouse 844 | 18899_5 | -1.000000 |
| ... | ... | ... |
| Mouse 1075 | J3295_11 | 0.183324 |
| Mouse 1076 | J3295_12 | 0.175674 |
| Mouse 1077 | J3295_13 | 0.158296 |
| Mouse 1078 | J3295_14 | 0.196296 |
| Mouse 1079 | J3295_15 | 0.187556 |
240 rows × 2 columns
Logical mask for whether or not a data value is missing.
df2 = df[has_down_syndrome & got_saline][['MouseID', 'pCFOS']]
df2.isnull()
| MouseID | pCFOS | |
|---|---|---|
| Mouse 840 | False | True |
| Mouse 841 | False | True |
| Mouse 842 | False | True |
| Mouse 843 | False | True |
| Mouse 844 | False | True |
| ... | ... | ... |
| Mouse 1075 | False | False |
| Mouse 1076 | False | False |
| Mouse 1077 | False | False |
| Mouse 1078 | False | False |
| Mouse 1079 | False | False |
240 rows × 2 columns
The number of missing values in each column.
# sum applied to each column
df.isnull().sum()
MouseID 0
DYRK1A 3
ITSN1 3
BDNF 3
NR1 3
..
CaNA 0
Genotype 0
Treatment 0
Behavior 0
class 0
Length: 82, dtype: int64
missing = df.isnull().sum()
missing.plot();
Import and configure matplotlib¶
import matplotlib.pyplot as plt
%config InlineBackend.figure_format = 'retina'
plt.style.use('bmh')
plt.rcParams['figure.figsize'] = [12,4]
plt.rcParams['lines.linewidth'] = 1.5
Pandas plots use matplotlib, so all styling will be applied.
missing.plot();
missing.plot.bar();
missing > 10
MouseID False
DYRK1A False
ITSN1 False
BDNF False
NR1 False
...
CaNA False
Genotype False
Treatment False
Behavior False
class False
Length: 82, dtype: bool
df.loc[:,missing > 10]
| ELK | Bcatenin | BAD | BCL2 | pCFOS | H3AcK18 | EGR1 | H3MeK4 | |
|---|---|---|---|---|---|---|---|---|
| Mouse 0 | 1.866358 | 3.037621 | 0.122652 | NaN | 0.108336 | 0.114783 | 0.131790 | 0.128186 |
| Mouse 1 | 1.761047 | 2.921882 | 0.116682 | NaN | 0.104315 | 0.111974 | 0.135103 | 0.131119 |
| Mouse 2 | 1.765544 | 2.944136 | 0.118508 | NaN | 0.106219 | 0.111883 | 0.133362 | 0.127431 |
| Mouse 3 | 1.286277 | 2.500204 | 0.132781 | NaN | 0.111262 | 0.130405 | 0.147444 | 0.146901 |
| Mouse 4 | 1.324695 | 2.456560 | 0.129954 | NaN | 0.110694 | 0.118481 | 0.140314 | 0.148380 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... |
| Mouse 1075 | 0.896430 | 1.828208 | 0.190483 | NaN | 0.183324 | 0.318782 | 0.204660 | 0.328327 |
| Mouse 1076 | 0.958879 | 1.883370 | 0.190463 | NaN | 0.175674 | 0.325639 | 0.200415 | 0.293435 |
| Mouse 1077 | 0.762892 | 1.494208 | 0.216682 | NaN | 0.158296 | 0.321306 | 0.229193 | 0.355213 |
| Mouse 1078 | 0.781893 | 1.571820 | 0.222263 | NaN | 0.196296 | 0.335936 | 0.251317 | 0.365353 |
| Mouse 1079 | 0.884735 | 1.742768 | 0.227606 | NaN | 0.187556 | 0.335062 | 0.252995 | 0.365278 |
1080 rows × 8 columns
You might think about not using the handful of features with a lot of missing values for your analysis.
# drop columns with a lot of missing data
dfok = df.loc[:,missing <= 10]
dfok
| MouseID | DYRK1A | ITSN1 | BDNF | NR1 | NR2A | pAKT | pBRAF | pCAMKII | pCREB | ... | Ubiquitin | pGSK3B_Tyr216 | SHH | pS6 | SYP | CaNA | Genotype | Treatment | Behavior | class | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Mouse 0 | 309_1 | 0.503644 | 0.747193 | 0.430175 | 2.816329 | 5.990152 | 0.218830 | 0.177565 | 2.373744 | 0.232224 | ... | 1.044979 | 0.831557 | 0.188852 | 0.106305 | 0.427099 | 1.675652 | Control | Memantine | C/S | c-CS-m |
| Mouse 1 | 309_2 | 0.514617 | 0.689064 | 0.411770 | 2.789514 | 5.685038 | 0.211636 | 0.172817 | 2.292150 | 0.226972 | ... | 1.009883 | 0.849270 | 0.200404 | 0.106592 | 0.441581 | 1.743610 | Control | Memantine | C/S | c-CS-m |
| Mouse 2 | 309_3 | 0.509183 | 0.730247 | 0.418309 | 2.687201 | 5.622059 | 0.209011 | 0.175722 | 2.283337 | 0.230247 | ... | 0.996848 | 0.846709 | 0.193685 | 0.108303 | 0.435777 | 1.926427 | Control | Memantine | C/S | c-CS-m |
| Mouse 3 | 309_4 | 0.442107 | 0.617076 | 0.358626 | 2.466947 | 4.979503 | 0.222886 | 0.176463 | 2.152301 | 0.207004 | ... | 0.990225 | 0.833277 | 0.192112 | 0.103184 | 0.391691 | 1.700563 | Control | Memantine | C/S | c-CS-m |
| Mouse 4 | 309_5 | 0.434940 | 0.617430 | 0.358802 | 2.365785 | 4.718679 | 0.213106 | 0.173627 | 2.134014 | 0.192158 | ... | 0.997775 | 0.878668 | 0.205604 | 0.104784 | 0.434154 | 1.839730 | Control | Memantine | C/S | c-CS-m |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| Mouse 1075 | J3295_11 | 0.254860 | 0.463591 | 0.254860 | 2.092082 | 2.600035 | 0.211736 | 0.171262 | 2.483740 | 0.207317 | ... | 1.261651 | 0.962942 | 0.275547 | 0.115806 | 0.374088 | 1.364823 | Ts65Dn | Saline | S/C | t-SC-s |
| Mouse 1076 | J3295_12 | 0.272198 | 0.474163 | 0.251638 | 2.161390 | 2.801492 | 0.251274 | 0.182496 | 2.512737 | 0.216339 | ... | 1.254872 | 0.983690 | 0.283207 | 0.113614 | 0.375259 | 1.364478 | Ts65Dn | Saline | S/C | t-SC-s |
| Mouse 1077 | J3295_13 | 0.228700 | 0.395179 | 0.234118 | 1.733184 | 2.220852 | 0.220665 | 0.161435 | 1.989723 | 0.185164 | ... | 1.242248 | 0.976609 | 0.290843 | 0.118948 | 0.422121 | 1.430825 | Ts65Dn | Saline | S/C | t-SC-s |
| Mouse 1078 | J3295_14 | 0.221242 | 0.412894 | 0.243974 | 1.876347 | 2.384088 | 0.208897 | 0.173623 | 2.086028 | 0.192044 | ... | 1.301071 | 0.989286 | 0.306701 | 0.125295 | 0.397676 | 1.404031 | Ts65Dn | Saline | S/C | t-SC-s |
| Mouse 1079 | J3295_15 | 0.302626 | 0.461059 | 0.256564 | 2.092790 | 2.594348 | 0.251001 | 0.191811 | 2.361816 | 0.223632 | ... | 1.267120 | 1.020383 | 0.292330 | 0.118899 | 0.420347 | 1.370999 | Ts65Dn | Saline | S/C | t-SC-s |
1080 rows × 74 columns
Exercise¶
Get the MouseID for all mice missing pCFOS measurements.
Exercise Key¶
Get the MouseID for all mice missing pCFOS measurements.
missing_pCFOS = df['pCFOS'].isnull()
df['MouseID'][missing_pCFOS]
Mouse 15 311_1
Mouse 16 311_2
Mouse 17 311_3
Mouse 18 311_4
Mouse 19 311_5
...
Mouse 1060 J1291_11
Mouse 1061 J1291_12
Mouse 1062 J1291_13
Mouse 1063 J1291_14
Mouse 1064 J1291_15
Name: MouseID, Length: 75, dtype: object
Add columns¶
df['HasDownSyndrome'] = has_down_syndrome
df
| MouseID | DYRK1A | ITSN1 | BDNF | NR1 | NR2A | pAKT | pBRAF | pCAMKII | pCREB | ... | SYP | H3AcK18 | EGR1 | H3MeK4 | CaNA | Genotype | Treatment | Behavior | class | HasDownSyndrome | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Mouse 0 | 309_1 | 0.503644 | 0.747193 | 0.430175 | 2.816329 | 5.990152 | 0.218830 | 0.177565 | 2.373744 | 0.232224 | ... | 0.427099 | 0.114783 | 0.131790 | 0.128186 | 1.675652 | Control | Memantine | C/S | c-CS-m | False |
| Mouse 1 | 309_2 | 0.514617 | 0.689064 | 0.411770 | 2.789514 | 5.685038 | 0.211636 | 0.172817 | 2.292150 | 0.226972 | ... | 0.441581 | 0.111974 | 0.135103 | 0.131119 | 1.743610 | Control | Memantine | C/S | c-CS-m | False |
| Mouse 2 | 309_3 | 0.509183 | 0.730247 | 0.418309 | 2.687201 | 5.622059 | 0.209011 | 0.175722 | 2.283337 | 0.230247 | ... | 0.435777 | 0.111883 | 0.133362 | 0.127431 | 1.926427 | Control | Memantine | C/S | c-CS-m | False |
| Mouse 3 | 309_4 | 0.442107 | 0.617076 | 0.358626 | 2.466947 | 4.979503 | 0.222886 | 0.176463 | 2.152301 | 0.207004 | ... | 0.391691 | 0.130405 | 0.147444 | 0.146901 | 1.700563 | Control | Memantine | C/S | c-CS-m | False |
| Mouse 4 | 309_5 | 0.434940 | 0.617430 | 0.358802 | 2.365785 | 4.718679 | 0.213106 | 0.173627 | 2.134014 | 0.192158 | ... | 0.434154 | 0.118481 | 0.140314 | 0.148380 | 1.839730 | Control | Memantine | C/S | c-CS-m | False |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| Mouse 1075 | J3295_11 | 0.254860 | 0.463591 | 0.254860 | 2.092082 | 2.600035 | 0.211736 | 0.171262 | 2.483740 | 0.207317 | ... | 0.374088 | 0.318782 | 0.204660 | 0.328327 | 1.364823 | Ts65Dn | Saline | S/C | t-SC-s | True |
| Mouse 1076 | J3295_12 | 0.272198 | 0.474163 | 0.251638 | 2.161390 | 2.801492 | 0.251274 | 0.182496 | 2.512737 | 0.216339 | ... | 0.375259 | 0.325639 | 0.200415 | 0.293435 | 1.364478 | Ts65Dn | Saline | S/C | t-SC-s | True |
| Mouse 1077 | J3295_13 | 0.228700 | 0.395179 | 0.234118 | 1.733184 | 2.220852 | 0.220665 | 0.161435 | 1.989723 | 0.185164 | ... | 0.422121 | 0.321306 | 0.229193 | 0.355213 | 1.430825 | Ts65Dn | Saline | S/C | t-SC-s | True |
| Mouse 1078 | J3295_14 | 0.221242 | 0.412894 | 0.243974 | 1.876347 | 2.384088 | 0.208897 | 0.173623 | 2.086028 | 0.192044 | ... | 0.397676 | 0.335936 | 0.251317 | 0.365353 | 1.404031 | Ts65Dn | Saline | S/C | t-SC-s | True |
| Mouse 1079 | J3295_15 | 0.302626 | 0.461059 | 0.256564 | 2.092790 | 2.594348 | 0.251001 | 0.191811 | 2.361816 | 0.223632 | ... | 0.420347 | 0.335062 | 0.252995 | 0.365278 | 1.370999 | Ts65Dn | Saline | S/C | t-SC-s | True |
1080 rows × 83 columns
Remove columns¶
del df['HasDownSyndrome']
df
| MouseID | DYRK1A | ITSN1 | BDNF | NR1 | NR2A | pAKT | pBRAF | pCAMKII | pCREB | ... | pCFOS | SYP | H3AcK18 | EGR1 | H3MeK4 | CaNA | Genotype | Treatment | Behavior | class | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Mouse 0 | 309_1 | 0.503644 | 0.747193 | 0.430175 | 2.816329 | 5.990152 | 0.218830 | 0.177565 | 2.373744 | 0.232224 | ... | 0.108336 | 0.427099 | 0.114783 | 0.131790 | 0.128186 | 1.675652 | Control | Memantine | C/S | c-CS-m |
| Mouse 1 | 309_2 | 0.514617 | 0.689064 | 0.411770 | 2.789514 | 5.685038 | 0.211636 | 0.172817 | 2.292150 | 0.226972 | ... | 0.104315 | 0.441581 | 0.111974 | 0.135103 | 0.131119 | 1.743610 | Control | Memantine | C/S | c-CS-m |
| Mouse 2 | 309_3 | 0.509183 | 0.730247 | 0.418309 | 2.687201 | 5.622059 | 0.209011 | 0.175722 | 2.283337 | 0.230247 | ... | 0.106219 | 0.435777 | 0.111883 | 0.133362 | 0.127431 | 1.926427 | Control | Memantine | C/S | c-CS-m |
| Mouse 3 | 309_4 | 0.442107 | 0.617076 | 0.358626 | 2.466947 | 4.979503 | 0.222886 | 0.176463 | 2.152301 | 0.207004 | ... | 0.111262 | 0.391691 | 0.130405 | 0.147444 | 0.146901 | 1.700563 | Control | Memantine | C/S | c-CS-m |
| Mouse 4 | 309_5 | 0.434940 | 0.617430 | 0.358802 | 2.365785 | 4.718679 | 0.213106 | 0.173627 | 2.134014 | 0.192158 | ... | 0.110694 | 0.434154 | 0.118481 | 0.140314 | 0.148380 | 1.839730 | Control | Memantine | C/S | c-CS-m |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| Mouse 1075 | J3295_11 | 0.254860 | 0.463591 | 0.254860 | 2.092082 | 2.600035 | 0.211736 | 0.171262 | 2.483740 | 0.207317 | ... | 0.183324 | 0.374088 | 0.318782 | 0.204660 | 0.328327 | 1.364823 | Ts65Dn | Saline | S/C | t-SC-s |
| Mouse 1076 | J3295_12 | 0.272198 | 0.474163 | 0.251638 | 2.161390 | 2.801492 | 0.251274 | 0.182496 | 2.512737 | 0.216339 | ... | 0.175674 | 0.375259 | 0.325639 | 0.200415 | 0.293435 | 1.364478 | Ts65Dn | Saline | S/C | t-SC-s |
| Mouse 1077 | J3295_13 | 0.228700 | 0.395179 | 0.234118 | 1.733184 | 2.220852 | 0.220665 | 0.161435 | 1.989723 | 0.185164 | ... | 0.158296 | 0.422121 | 0.321306 | 0.229193 | 0.355213 | 1.430825 | Ts65Dn | Saline | S/C | t-SC-s |
| Mouse 1078 | J3295_14 | 0.221242 | 0.412894 | 0.243974 | 1.876347 | 2.384088 | 0.208897 | 0.173623 | 2.086028 | 0.192044 | ... | 0.196296 | 0.397676 | 0.335936 | 0.251317 | 0.365353 | 1.404031 | Ts65Dn | Saline | S/C | t-SC-s |
| Mouse 1079 | J3295_15 | 0.302626 | 0.461059 | 0.256564 | 2.092790 | 2.594348 | 0.251001 | 0.191811 | 2.361816 | 0.223632 | ... | 0.187556 | 0.420347 | 0.335062 | 0.252995 | 0.365278 | 1.370999 | Ts65Dn | Saline | S/C | t-SC-s |
1080 rows × 82 columns
Visualize columns¶
df['pCFOS'].plot();
Plots use Matplotlib.
df['pCFOS'].plot()
plt.ylabel('pCFOS')
plt.title('pCFOS column data');
df[['pCREB', 'pCFOS']].plot()
plt.ylabel('pCFOS')
plt.title('Each trace is a column');
Group rows & compute something for each group¶
Also see pivot_table
df.groupby('Genotype')
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x11eafcd70>
df.groupby('Genotype').mean(numeric_only=True)
| DYRK1A | ITSN1 | BDNF | NR1 | NR2A | pAKT | pBRAF | pCAMKII | pCREB | pELK | ... | SHH | BAD | BCL2 | pS6 | pCFOS | SYP | H3AcK18 | EGR1 | H3MeK4 | CaNA | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Genotype | |||||||||||||||||||||
| Control | 0.404756 | 0.575927 | 0.321132 | 2.333654 | 3.984687 | 0.229578 | 0.181515 | 3.505182 | 0.207285 | 1.443155 | ... | 0.226943 | 0.157057 | 0.135478 | 0.123365 | 0.132923 | 0.460376 | 0.156779 | 0.189647 | 0.198516 | 1.324037 |
| Ts65Dn | 0.449480 | 0.663394 | 0.316791 | 2.256362 | 3.685690 | 0.237204 | 0.182220 | 3.573004 | 0.218520 | 1.412411 | ... | 0.226378 | 0.158766 | 0.134017 | 0.119460 | 0.128882 | 0.430088 | 0.183324 | 0.176156 | 0.211409 | 1.353148 |
2 rows × 77 columns
genotype_means = df.groupby('Genotype').mean(numeric_only=True)
# plot defaults to plotting each column as a line
genotype_means.plot(legend=None);
genotype_means.T
| Genotype | Control | Ts65Dn |
|---|---|---|
| DYRK1A | 0.404756 | 0.449480 |
| ITSN1 | 0.575927 | 0.663394 |
| BDNF | 0.321132 | 0.316791 |
| NR1 | 2.333654 | 2.256362 |
| NR2A | 3.984687 | 3.685690 |
| ... | ... | ... |
| SYP | 0.460376 | 0.430088 |
| H3AcK18 | 0.156779 | 0.183324 |
| EGR1 | 0.189647 | 0.176156 |
| H3MeK4 | 0.198516 | 0.211409 |
| CaNA | 1.324037 | 1.353148 |
77 rows × 2 columns
genotype_means.T.plot();
genotype_means.T.plot.bar();
Nested groups¶
df.groupby(['Genotype', 'Treatment']).sum(numeric_only=True)
| DYRK1A | ITSN1 | BDNF | NR1 | NR2A | pAKT | pBRAF | pCAMKII | pCREB | pELK | ... | SHH | BAD | BCL2 | pS6 | pCFOS | SYP | H3AcK18 | EGR1 | H3MeK4 | CaNA | ||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Genotype | Treatment | |||||||||||||||||||||
| Control | Memantine | 113.048904 | 163.342224 | 94.524479 | 679.107326 | 1165.193447 | 70.677721 | 55.763691 | 1147.877081 | 60.994917 | 415.632666 | ... | 67.707033 | 36.093774 | 27.603222 | 37.183411 | 36.413589 | 138.641528 | 39.322980 | 43.772525 | 35.955030 | 380.042435 |
| Saline | 117.662182 | 164.936101 | 88.520562 | 651.075739 | 1106.078423 | 60.181752 | 47.699587 | 850.076614 | 57.157384 | 406.965410 | ... | 61.650598 | 31.754763 | 27.265559 | 33.134565 | 35.364678 | 123.772534 | 33.579453 | 41.568827 | 38.488287 | 374.658381 | |
| Ts65Dn | Memantine | 128.135872 | 184.111650 | 85.562397 | 617.758208 | 1028.994680 | 65.171495 | 50.619970 | 998.872821 | 58.749817 | 397.631829 | ... | 60.426136 | 29.642812 | 28.212642 | 33.140195 | 32.913052 | 116.010262 | 34.909445 | 37.637040 | 41.285369 | 361.054093 |
| Saline | 99.750587 | 152.228856 | 75.050752 | 526.217556 | 839.650309 | 55.091079 | 41.765377 | 812.639988 | 52.040005 | 318.460521 | ... | 55.026843 | 39.420299 | 24.054167 | 27.784312 | 27.016986 | 103.334536 | 44.836425 | 36.348656 | 50.677370 | 329.051542 |
4 rows × 77 columns
groups = df.groupby(['Genotype', 'Treatment']).sum(numeric_only=True)
groups.T
| Genotype | Control | Ts65Dn | ||
|---|---|---|---|---|
| Treatment | Memantine | Saline | Memantine | Saline |
| DYRK1A | 113.048904 | 117.662182 | 128.135872 | 99.750587 |
| ITSN1 | 163.342224 | 164.936101 | 184.111650 | 152.228856 |
| BDNF | 94.524479 | 88.520562 | 85.562397 | 75.050752 |
| NR1 | 679.107326 | 651.075739 | 617.758208 | 526.217556 |
| NR2A | 1165.193447 | 1106.078423 | 1028.994680 | 839.650309 |
| ... | ... | ... | ... | ... |
| SYP | 138.641528 | 123.772534 | 116.010262 | 103.334536 |
| H3AcK18 | 39.322980 | 33.579453 | 34.909445 | 44.836425 |
| EGR1 | 43.772525 | 41.568827 | 37.637040 | 36.348656 |
| H3MeK4 | 35.955030 | 38.488287 | 41.285369 | 50.677370 |
| CaNA | 380.042435 | 374.658381 | 361.054093 | 329.051542 |
77 rows × 4 columns
groups.T.plot();
groups.T.plot.bar();
groups.T[:20].plot.bar();
Sort rows¶
df.sort_values(by='BDNF')[['MouseID', 'Genotype', 'BDNF']]
| MouseID | Genotype | BDNF | |
|---|---|---|---|
| Mouse 181 | 3413_2 | Control | 0.115181 |
| Mouse 192 | 3413_13 | Control | 0.185980 |
| Mouse 974 | 3425_15 | Ts65Dn | 0.194160 |
| Mouse 193 | 3413_14 | Control | 0.197271 |
| Mouse 190 | 3413_11 | Control | 0.198159 |
| ... | ... | ... | ... |
| Mouse 32 | 320_3 | Control | 0.470056 |
| Mouse 375 | 3497_1 | Control | 0.497160 |
| Mouse 987 | 3426_13 | Ts65Dn | NaN |
| Mouse 988 | 3426_14 | Ts65Dn | NaN |
| Mouse 989 | 3426_15 | Ts65Dn | NaN |
1080 rows × 3 columns
# One reason why it is beneficial to attach indexes to each row
# irrespective of their order is it's easy to keep track of
# specific subjects no matter how we rearrange the data.
df.sort_values(by='BDNF')['MouseID']['Mouse 974']
'3425_15'
Changing the original DataFrame¶
Everything we've done so far has not altered the original DataFrame.
But what if we want to alter it?
- Option 1: Assign a
copyof whatever we did to anew DataFrame. - Option 2: Edit the original DataFrame directily by telling Pandas to perform actions
inplace.
# df2 is a copy of the sorted df, df is unchanged
df2 = df.sort_values(by='BDNF')
# df2 is itself resorted
df2.sort_values(by='pCFOS', inplace=True)
Editing the data frame¶
Most column manipulations work just like NumPy arrays.
df2['ITSN1'] = df2['DYRK1A'] * 100
df2['BDNF'] = 8
df2[['DYRK1A', 'ITSN1', 'BDNF']]
| DYRK1A | ITSN1 | BDNF | |
|---|---|---|---|
| Mouse 149 | 0.472849 | 47.284879 | 8 |
| Mouse 887 | 0.880699 | 88.069855 | 8 |
| Mouse 272 | 0.376922 | 37.692195 | 8 |
| Mouse 885 | 0.787914 | 78.791400 | 8 |
| Mouse 140 | 0.633407 | 63.340684 | 8 |
| ... | ... | ... | ... |
| Mouse 50 | 0.457992 | 45.799236 | 8 |
| Mouse 21 | 0.702608 | 70.260829 | 8 |
| Mouse 46 | 0.516108 | 51.610811 | 8 |
| Mouse 45 | 0.560614 | 56.061437 | 8 |
| Mouse 47 | 0.507321 | 50.732128 | 8 |
1080 rows × 3 columns
df2.loc['Mouse 140', 'BDNF'] = 100
df2[['BDNF']]
| BDNF | |
|---|---|
| Mouse 149 | 8 |
| Mouse 887 | 8 |
| Mouse 272 | 8 |
| Mouse 885 | 8 |
| Mouse 140 | 100 |
| ... | ... |
| Mouse 50 | 8 |
| Mouse 21 | 8 |
| Mouse 46 | 8 |
| Mouse 45 | 8 |
| Mouse 47 | 8 |
1080 rows × 1 columns
df2.loc['Mouse 21':'Mouse 45', 'BDNF'] = 500
df2[['BDNF']]
| BDNF | |
|---|---|
| Mouse 149 | 8 |
| Mouse 887 | 8 |
| Mouse 272 | 8 |
| Mouse 885 | 8 |
| Mouse 140 | 100 |
| ... | ... |
| Mouse 50 | 8 |
| Mouse 21 | 500 |
| Mouse 46 | 500 |
| Mouse 45 | 500 |
| Mouse 47 | 8 |
1080 rows × 1 columns
NumPy --> Pandas¶
import numpy as np
a = np.random.random([3,4])
a
array([[0.70046587, 0.86081322, 0.19431979, 0.15695987],
[0.67005485, 0.87417845, 0.72893385, 0.00119436],
[0.2056958 , 0.34201637, 0.72300174, 0.0150288 ]])
af = pd.DataFrame(a)
af
| 0 | 1 | 2 | 3 | |
|---|---|---|---|---|
| 0 | 0.700466 | 0.860813 | 0.194320 | 0.156960 |
| 1 | 0.670055 | 0.874178 | 0.728934 | 0.001194 |
| 2 | 0.205696 | 0.342016 | 0.723002 | 0.015029 |
af.columns = ['A', 'B', 'C', 'D']
af
| A | B | C | D | |
|---|---|---|---|---|
| 0 | 0.700466 | 0.860813 | 0.194320 | 0.156960 |
| 1 | 0.670055 | 0.874178 | 0.728934 | 0.001194 |
| 2 | 0.205696 | 0.342016 | 0.723002 | 0.015029 |
Pandas --> NumPy¶
af
| A | B | C | D | |
|---|---|---|---|---|
| 0 | 0.700466 | 0.860813 | 0.194320 | 0.156960 |
| 1 | 0.670055 | 0.874178 | 0.728934 | 0.001194 |
| 2 | 0.205696 | 0.342016 | 0.723002 | 0.015029 |
af.to_numpy()
array([[0.70046587, 0.86081322, 0.19431979, 0.15695987],
[0.67005485, 0.87417845, 0.72893385, 0.00119436],
[0.2056958 , 0.34201637, 0.72300174, 0.0150288 ]])
Exercise¶
Get the BDNF and ITSN1 columns for the original dataframe df as a $N \times 2$ NumPy array.
Exercise Key¶
Get the BDNF and ITSN1 columns for the original dataframe df as a $N \times 2$ NumPy array.
df[['BDNF', 'ITSN1']].to_numpy()
array([[0.4301753 , 0.74719322],
[0.41177034, 0.68906355],
[0.41830878, 0.7302468 ],
...,
[0.23411809, 0.39517937],
[0.24397413, 0.41289438],
[0.25656431, 0.46105919]])
import seaborn as sns
sns.relplot(x="BDNF", y="pCAMKII", data=df);
sns.catplot(x="Genotype", y="pCAMKII", data=df);
sns.catplot(x="Genotype", y="pCAMKII", kind="box", data=df);
sns.catplot(x="Genotype", y="pCAMKII", hue="Treatment", kind="box", data=df);
sns.catplot(x="Genotype", y="pCAMKII", hue="Treatment", col="Behavior", kind="box", data=df);
sns.catplot(x="Genotype", y="pCAMKII", hue="Treatment", col="Behavior", kind="swarm", data=df);
sns.catplot(x="Genotype", y="pCAMKII", row="Treatment", col="Behavior", kind="swarm", data=df);
Exercise¶
Plot box plots comparing BDDNF expression levels for each behavior pardigm under each treatment protocol.
Exercise Key¶
Plot box plots comparing BDDNF expression levels for each behavior pardigm under each treatment protocol.
sns.catplot(x="Behavior", y="BDNF", hue="Treatment", kind="box", data=df);
sns.pairplot(df.loc[:,'NR2A':'pCAMKII']);
Correlations amongst columns¶
df.corr(numeric_only=True)
| DYRK1A | ITSN1 | BDNF | NR1 | NR2A | pAKT | pBRAF | pCAMKII | pCREB | pELK | ... | SHH | BAD | BCL2 | pS6 | pCFOS | SYP | H3AcK18 | EGR1 | H3MeK4 | CaNA | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| DYRK1A | 1.000000 | 0.959512 | 0.359433 | 0.294526 | 0.325063 | -0.180959 | -0.093737 | -0.180169 | 0.047299 | 0.791173 | ... | -0.199363 | -0.253175 | -0.167358 | -0.326577 | -0.228488 | -0.047246 | -0.384816 | -0.323397 | -0.473734 | 0.476033 |
| ITSN1 | 0.959512 | 1.000000 | 0.459884 | 0.422492 | 0.426182 | -0.147775 | -0.076507 | -0.132870 | 0.171122 | 0.780926 | ... | -0.200833 | -0.327136 | -0.257745 | -0.256573 | -0.309593 | 0.042422 | -0.347339 | -0.399856 | -0.511433 | 0.517782 |
| BDNF | 0.359433 | 0.459884 | 1.000000 | 0.805784 | 0.749993 | 0.317537 | 0.390530 | 0.246760 | 0.603857 | 0.451598 | ... | -0.080776 | -0.186508 | -0.146526 | 0.127703 | -0.254218 | 0.450433 | -0.135637 | -0.239907 | -0.244732 | 0.262578 |
| NR1 | 0.294526 | 0.422492 | 0.805784 | 1.000000 | 0.873889 | 0.211540 | 0.244224 | 0.301151 | 0.597443 | 0.416629 | ... | -0.141423 | -0.386577 | -0.321790 | 0.264181 | -0.374932 | 0.473366 | -0.234171 | -0.319672 | -0.351958 | 0.205463 |
| NR2A | 0.325063 | 0.426182 | 0.749993 | 0.873889 | 1.000000 | 0.110172 | 0.111143 | 0.280733 | 0.392695 | 0.409502 | ... | -0.259166 | -0.449268 | -0.297453 | 0.262492 | -0.434530 | 0.400526 | -0.314210 | -0.346987 | -0.454366 | 0.187292 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| SYP | -0.047246 | 0.042422 | 0.450433 | 0.473366 | 0.400526 | 0.248531 | 0.284492 | 0.244965 | 0.413060 | 0.039426 | ... | 0.056826 | 0.052303 | 0.039724 | 0.423405 | -0.039379 | 1.000000 | 0.031898 | 0.098065 | -0.010347 | 0.246107 |
| H3AcK18 | -0.384816 | -0.347339 | -0.135637 | -0.234171 | -0.314210 | 0.338456 | 0.301049 | 0.193369 | 0.235879 | -0.259559 | ... | 0.241714 | 0.465087 | 0.379046 | 0.138683 | 0.286105 | 0.031898 | 1.000000 | 0.332770 | 0.769093 | -0.239835 |
| EGR1 | -0.323397 | -0.399856 | -0.239907 | -0.319672 | -0.346987 | 0.280476 | 0.349281 | 0.079262 | -0.117623 | -0.202839 | ... | 0.378905 | 0.805637 | 0.841650 | 0.239695 | 0.725289 | 0.098065 | 0.332770 | 1.000000 | 0.731871 | -0.470487 |
| H3MeK4 | -0.473734 | -0.511433 | -0.244732 | -0.351958 | -0.454366 | 0.383404 | 0.402397 | 0.065615 | 0.056099 | -0.277555 | ... | 0.324445 | 0.749225 | 0.712321 | 0.222595 | 0.558868 | -0.010347 | 0.769093 | 0.731871 | 1.000000 | -0.404005 |
| CaNA | 0.476033 | 0.517782 | 0.262578 | 0.205463 | 0.187292 | -0.238567 | -0.213691 | -0.375797 | 0.039517 | 0.233863 | ... | -0.123215 | -0.276626 | -0.299910 | -0.344369 | -0.302378 | 0.246107 | -0.239835 | -0.470487 | -0.404005 | 1.000000 |
77 rows × 77 columns
corr = df.corr(numeric_only=True)
plt.figure(figsize=[10,8])
sns.heatmap(corr);
Note that the corr() matrix does NOT contain any categorical variables!
# see heatmap.py module file
from heatmap import heatmap, corrplot
plt.figure(figsize=[12,12])
ax = corrplot(corr, size_scale=50)
ax[0].set_xticklabels(ax[0].get_xticklabels(), rotation=90, ha='center');
import dython
result = dython.nominal.associations(df.iloc[:,1:], compute_only=True) # ignore MouseID column
result['corr']
/opt/miniconda3/envs/neu365/lib/python3.12/site-packages/dython/nominal.py:708: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise an error in a future version of pandas. Value '' has dtype incompatible with float64, please explicitly cast to a compatible dtype first. inf_nan.loc[columns[i], columns[j]] = _inf_nan_str(ij) /opt/miniconda3/envs/neu365/lib/python3.12/site-packages/dython/nominal.py:709: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise an error in a future version of pandas. Value '' has dtype incompatible with float64, please explicitly cast to a compatible dtype first. inf_nan.loc[columns[j], columns[i]] = _inf_nan_str(ji)
| DYRK1A | ITSN1 | BDNF | NR1 | NR2A | pAKT | pBRAF | pCAMKII | pCREB | pELK | ... | pCFOS | SYP | H3AcK18 | EGR1 | H3MeK4 | CaNA | Genotype | Treatment | Behavior | class | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| DYRK1A | 1.000000 | 0.959255 | 0.367778 | 0.306508 | 0.335411 | -0.147453 | -0.058022 | -0.164829 | 0.073716 | 0.792209 | ... | -0.177847 | -0.045088 | -0.227613 | -0.043643 | -0.225390 | 0.472178 | 0.084061 | 0.006329 | 0.507845 | 0.538328 |
| ITSN1 | 0.959255 | 1.000000 | 0.473075 | 0.437863 | 0.440253 | -0.104235 | -0.028655 | -0.112148 | 0.202193 | 0.785017 | ... | -0.202560 | 0.044856 | -0.191353 | -0.025779 | -0.225843 | 0.510772 | 0.164719 | 0.024292 | 0.485934 | 0.535317 |
| BDNF | 0.367778 | 0.473075 | 1.000000 | 0.826391 | 0.762199 | 0.379601 | 0.456188 | 0.277158 | 0.645409 | 0.473381 | ... | -0.140422 | 0.433283 | -0.036878 | -0.060561 | -0.186493 | 0.241898 | 0.059476 | 0.045885 | 0.154089 | 0.310029 |
| NR1 | 0.306508 | 0.437863 | 0.826391 | 1.000000 | 0.876241 | 0.284832 | 0.327402 | 0.328389 | 0.640559 | 0.440822 | ... | -0.189127 | 0.454020 | -0.086095 | -0.028069 | -0.196479 | 0.187222 | 0.123151 | 0.045167 | 0.034652 | 0.298001 |
| NR2A | 0.335411 | 0.440253 | 0.762199 | 0.876241 | 1.000000 | 0.163371 | 0.173304 | 0.301790 | 0.431922 | 0.428873 | ... | -0.245247 | 0.395922 | -0.178156 | -0.071078 | -0.275716 | 0.178642 | 0.167971 | 0.017966 | 0.120696 | 0.358313 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| CaNA | 0.472178 | 0.510772 | 0.241898 | 0.187222 | 0.178642 | -0.234543 | -0.208196 | -0.374794 | 0.030708 | 0.227569 | ... | -0.254194 | 0.246107 | -0.068643 | -0.045902 | -0.189644 | 1.000000 | 0.045850 | 0.125451 | 0.748772 | 0.781937 |
| Genotype | 0.084061 | 0.164719 | 0.059476 | 0.123151 | 0.167971 | 0.071788 | 0.006394 | 0.017890 | 0.144414 | 0.041299 | ... | 0.103784 | 0.227713 | 0.170673 | 0.028657 | 0.245490 | 0.045850 | 1.000000 | 0.000000 | 0.000000 | 0.997213 |
| Treatment | 0.006329 | 0.024292 | 0.045885 | 0.045167 | 0.017966 | 0.141788 | 0.195526 | 0.193406 | 0.058549 | 0.004628 | ... | 0.008524 | 0.010898 | 0.141047 | 0.061332 | 0.194178 | 0.125451 | 0.000000 | 1.000000 | 0.000000 | 0.997213 |
| Behavior | 0.507845 | 0.485934 | 0.154089 | 0.034652 | 0.120696 | 0.322728 | 0.281845 | 0.475608 | 0.208261 | 0.290978 | ... | 0.261077 | 0.072460 | 0.162500 | 0.176697 | 0.187514 | 0.748772 | 0.000000 | 0.000000 | 1.000000 | 0.997213 |
| class | 0.538328 | 0.535317 | 0.310029 | 0.298001 | 0.358313 | 0.409956 | 0.372766 | 0.560796 | 0.292861 | 0.331422 | ... | 0.357843 | 0.280641 | 0.351382 | 0.205604 | 0.373513 | 0.781937 | 0.997213 | 0.997213 | 0.997213 | 1.000000 |
81 rows × 81 columns
plt.figure(figsize=[12,12])
ax = corrplot(result['corr'], size_scale=50)
ax[0].set_xticklabels(ax[0].get_xticklabels(), rotation=90, ha='center');
Exercise¶
Which protein is most correlated with class?
Plot categorical jitter plots comparing the expression level of this protein for each class.
Exercise Key¶
Which protein is most correlated with class?
Plot categorical jitter plots comparing the expression level of this protein for each class.
sns.catplot(x="class", y="SOD1", data=df);
From the above exercise, it is clear that SOD1 primarily differs based on the Behavior conditioning paradigm.
sns.catplot(x="Behavior", y="SOD1", hue="Genotype", data=df);
import hvplot.pandas
df['pCFOS'].hvplot()
df[['pCREB', 'pCFOS']].hvplot()